Introduction
Diabetes Mellitus (DM) is a major public health challenge, where risk factors (obesity, age, and race and gender) contribute to developing diabetes. Identifying risk factors (predictors) is essential for prevention and targeted intervention. Logistic regression is the standard tool used to estimate the association between risk factors and binary outcomes, such as the presence or absence of diabetes. However, classical maximum likelihood estimation (MLE) can be unstable when there are small sample sizes, missing data, or situations of quasi- or complete separation in the data.
Healthcare data includes wide variety of samples (DNA sequences, functional images of the brain, patient-reported outcomes, and electronic health records (EHR), survey records, sequences of health measurements, diagnoses, and treatments) are complex, and the standard approaches to analyze them are not adequate (Zeger?) et al. (2020) performed Bayesian Hierarchical Model and MCMC, by combining prior knowledge and patient data (EHR) to predict the patient’s health status, trajectory, and/or likely benefits of intervention and using multivariate longitudinal patient data using R-packages developed two-levels (1)time within person and (2) persons within a population along with co-variates and combined exogenous (age, clinical history) and endogenous (current treatment) variables on the individual’s multivariate health measurements. The model provided posterior distribution and an estimate of the marginal distribution of the regression coefficients by integration of Markov chain Monte Carlo (MCMC). The model was used to identify low-risk patient population with pneumonia etiology (children), prostate cancer, and mental disorders. Because it was entirely parametric (model limitations), an extensions to nonparametric or more flexible parametric models was recommended.
Bayesian Inference (parametric vs non-parametric) study conducted by Chatzimichail and Hatjimihail (2023) calculated the posterior probability of disease diagnosis by applying Bayesian inference via two modules comparing parametric (with a fixed set of parameters) and nonparametric distributions (which do not make a priori assumptions) on National Health and Nutrition Examination Survey data from two separate diagnostic tests on both diseased and non-diseased populations were used for model development. The mentioned conventional methods based on clinical criteria and fixed numerical thresholds limit the information captured on the intricate relationship between diagnostic tests and the varying prevalence of diseases. The probability distributions associated with quantitative diagnostic test outcomes overlap between the diseased and nondiseased groups. The dichotomous method fails to capture the complexity and heterogeneity of disease presentations across diverse populations. The applicability of the normal distribution (conventional method) is critiqued in dealing with skewness, bimodality, or multimodality. they reported Bayesian nonparametric (vs parametric) diagnostic modeling as a Flexible distributional modeling for test outcomes (posterior disease probabilities). Their models using Bayesian inference for posterior probability calculation used Wolfram Language and integrated prior probabilities of disease with distributions in both diseased and nondiseased populations which enabled them to evaluate the combined data from multiple diagnostic tests with improved diagnostic accuracy, precision, and adaptability. The model showed flexibility, adaptability, and versatility in the diagnostic. They found Nonparametric Bayesian models a better fit for data distributions given the limited existing literature. Model was reported robust in capturing complex data patterns, producing multimodal probability patterns for disease, unlike the bimodal, double-sigmoidal curves seen with parametric models. The study limitations is the reliance on parametric models, limited scholarly publications and over-dependence on prior probabilities increase the uncertainties, resulting in broader confidence intervals for posterior probabilities. They mentioned systemic bias (unrepresentative datasets), incomplete datasets and absence of normative data compromises the accuracy of results, reliability and validity of Bayesian diagnostic methods and that combined with other statistical and computational techniques could enhance diagnostic capabilities, Chatzimichail and Hatjimihail (2023)
To understand the Bayesian methodology an overview (stages, development and advantages) by Schoot et al. (2021) specify the importance of the priors, data modeling, inferences, model checking and refinement, selecting a proper sampling technique from a posterior distribution, variational inferences, variable selection, and its application across various research fields. The y applied Bayesian statistics across different fields (social sciences, ecology, genetics, medicine) in observed and unobserved parameters. They emphasize variable selection as a process of identifying the sub-set of predictors to include in a model especially where a large number of potential predictors are available. Unnecessary variables present issues such as multicollinearity, insufficient samples, overfitting the current data leading to poor predictive performance on new data making model interpretation difficult. Variables selection is best after checking correlations among the variables in the model (Eg: gene-to-gene interaction to predict genes in biomedical research).
Considering small sample size, Bayesian estimation with mildly informative priors is often. Based on the degree of (un)certainty (hyperparameters) surrounding the population parameter priors (informative, weakly informative and diffuse), prior distribution with a larger variance represents a greater amount of uncertainty. Prior elicitation could be through different ways (experts, generic expert, data-based, sample data using maximum likelihood or sample statistics). A prior sensitivity analysis of the likelihood that examines different forms of the model could assess how the priors and the likelihood align and impact on posterior estimates, reflecting variations not captured by the prior or the likelihood alone. Prior estimation allows data-informed shrinkage, regularization or influence algorithms towards a likely high-density region, and improves estimation efficiency. In a small sample i.e. less information, incorporation of priors strengthens the observed data and lends possible value(s) for the unknown parameter(s). To know probabilistic specification of the priors for a complex model with smaller sample sizes is important. In Bayesian inference, unknown parameters (random variables) have varied values, while the (observed) data have fixed values. The likelihood is a function of θ for the fixed data y. The likelihood function summarizes a statistical model that stochastically generates a range of possible values for θ and the observed data y. With priors and the likelihood of the observed data, the resulting posterior distribution provides an estimate of the unknown parameters, capture primary factors to improve our understanding. Monte Carlo technique provides integrals of sampled values from a given distribution through computer simulations. The packages BRMS and Blavaan in R are used for the probabilistic programming language Stan. MCMC algorithm only requires the probability distribution of interest to be specified up to a constant of proportionality and is scalable to high dimensions to obtain empirical estimates of the posterior distribution of interest. Bayesian inference adopts a simulation-based strategy for approximating posterior distributions.Frequentists do not consider the probability of the unknown parameters and consider to be fixed and likelihood is considered as the the conditional probability distribution p(y|θ) of the data (y), given fixed parameters (θ). Spatial and temporal variability are factored in Bayesian general linear models. A posterior distribution can simulate new data conditional on this distribution and assess, to providing valid predictions. The Bayesian approach in analyzing large-scale cancer genomic data, identifies novel molecular changes in cancer initiation and progression, the interactions between mutated genes, captured mutational signatures, key genetic interactions components, allow genomic-based patient stratification (clinical trials, personalized use of therapeutics) and in understanding cancer evolutionary processes. The mentioned the model reproducibility, reporting standards, and outlined a checklist. Limitations were related to the dependencies (autocorrelation of parameters over time in temporal model) and the subjectivity issue of priors.
The study by Klauenberg et al. (2015) emphasizes prior elicitation, analytical posteriors, robustness checks through guidance provided on Bayesian inference by performing Bayesian Normal linear regression, Core parametric (conjugate)model with Normal–Inverse-Gamma prior in metrology to calibrate instruments to evaluate inter-laboratory comparisons in determining fundamental constants.
In gaussian- Errors are independent and identically distributed, variance is unknown and is estimated from data, the relationship between X and Y is statistical, with noise and model uncertainty and the regression can not be treated as a measurement function . They mentioned statistical approaches (likelihood, Bayesian, bootstrap, etc.) could quantify uncertainty since Guide to the Expression of Uncertainty in Measurement (GUM) and its supplements are not applicable directly. They emphasized Bayesian inference that accounts for a priori information, and robustifies the analyses through steps including prior elicitation, posterior calculation, and robustness to prior uncertainty and model adequacy for the model development and about assumptions critical to Bayesian inference.
In Bayesian, unknowns such as (observables: data and unobservables: parameters and auxiliary variables) are random, are assigned probability distributions of the available information, and update prior knowledge about the unobservables with information about them contained in the data. The graphical representation of prior distribution and likelihood function, sensitivity analyses, or model checking enhances the elicitation and interpretation process.
For Normal linear regression (1) Normal inverse Gamma (NIG) distribution to a posterior is from the same family of (NIG) distribution. The NIG prior with known variance σ2 of observations is a conjugate prior distribution. Vague or non-informative prior distributions can be derived from NIG prior (2) alternative families (hierarchical priors) assign an additional layer of distributions to uncertain prior parameters or non-parametricriors.
Bayesian inference is influenced by - the uncertainty in the transformation of prior knowledge to prior distributions - the assumptions of the statistical model - the mistakes in data acquisition
Bayesian Hierarchical / meta-analytic linear regression model in a study by (DeLeeuw2012awas?) was developed to test of a formal method for augmenting data in linear regression analyses, by incorporating both exchangeable and unexchangeable information on regression coefficients (and standard errors) of previous studies.
They highlighted issues of multiple testing resulting in relatively low statistical power, which is problematic in null-hypothesis significance testing. In multiple linear regression models with separate significance tests for all regression coefficients, with the modest sample sizes, in different studies with different sets of statistically significant predictors, and addressing for larger samples is practically unrealistic. Linear regression analyses do not account summary statistics from similar previous studies and ignoring past information on parameters affects the stability and precision of the parameter estimates and report lower values, are less certain and are affected by sampling variation.
The study conducted Bayesian linear regression by accommodating prior knowledge to overcome the absence of formal studies, to handle issues of increasing the sample size. They augmented the data of a new study and incorporated priors on regression coefficients and standard errors from previous similar studies.
To solve the issue of the univariate case analysis and the issue of simultaneously combining multiple regression parameters per study, which ignore the relationship between the regression coefficients, Bayesian linear regression combined with the evidence of specific predictors from different linear regression analyses (meta-analysis) was conducted. Adding summary statistics from previous studies provided a more acceptable solution to when previous study data are not (realistically) obtainable.
Based on the information on predictors from previous and current data, the models are categorized into (1) Exchangable - when the current data and previous studies have the same set of predictors. (2) Unexchangable – when the predictors differ in the two studies.
They emphasize steps -
- To calculate the probability density function for the data, given the unknown model parameters; the Standard multiple linear regression model, integrate the prior, and provide the joint posterior density using the Gibbs sampler.
- The likelihood function - that quantifies what is assumed to be known about the model parameters before observing the data.
- A hierarchical model use to analyze parameters where studies are not-exchangeable.
They found incorporating priors in a linear regression on new data yield a significantly better parameter estimate with an adequate approximation, encouraging performance gains and the large effects.
Performance of the two versions (exchangeable and unexchangeable) of the replication model was consistently superior to using the current data alone.
The model using exchangeable and unexchangeable prior offers better parameter estimates in a linear regression setting without the need to expend a large amount of time and energy to obtain data from the previous studies. Hierarchical unexchangeable model offers the advantage of being able to address questions about differences between studies and thus allows for explicit testing of the exchangeability assumption.
Limitations were based on having the same set of predictors and correlation between predictor variables. Leeuw and Klugkist (2012)
Bayesian logistic regression (Bayesian GLM) (Sequential clinical reasoning approach) model in longitudinal prospective cohort was developed to predict the risk of incident cardiovascular disease. They developed 3 models : (1) demographic features (basic model) (2) six metabolic syndrome components (metabolic score model) (3) conventional risk factors (enhanced model). The method was based on the application of Logistic Regression including priors on coefficients and sequential updating to predict individual-level CVD risk.
In Liu et al. (2013) study, the author developed model to overcome the issue of limited availability of molecular information in clinical practice (high cost and unavailability) that affects efficient disease diagnosis. They mentioned an alternative approach to analyze data to efficiently identify a high-risk population based on the routinely checked biological markers before doing these expensive molecular tests.
Model such as Framingham Risk Score method were not sufficient because of Heterogeneity (geographic, ethnic group, variations, and social contextual network) observed was often as unobservable and unmeasurable and required to construction separate models.
They developed and applied the Sequential clinical reasoning approach model on subjects enrolled in a screening program (Keelung Community) (20–79 years) in the Keelung city of Taiwan, and were analyzed for 5 years to identify incident cancers and chronic diseases (cardiovascular disease). The study classified incident CVD cases by incorporating (1) standardized risk score (MetS: fasting glucose, blood pressure, HDL-C, triglyceride and waist circumference) (2) risk factors: gender, heredity, smoking, alcohol drinking, family history and betel quid.
The methodology used is the Bayesian clinical reasoning approach based on a sequential manner to developed three models that emulated a clinician’s evaluation process. The Bayesian clinical reasoning approach considered the normal distribution of regression coefficients of all predictors, allowing for uncertainty of clinical weights and the credible intervals of predicted risk estimates were averaged.
In the model, the individual risk is elicited by prior speculation (first impression) that is updated by objective observed data (patient’s history and laboratory findings), the regression coefficients for computing risk score were treated as random variable with normal distribution rather than a fixed value (traditional risk prediction model by frequentist). The updated prior distribution with the likelihood of the current data provided a posterior distribution to predict the risk for a specific disease.
On comparing the three models, the enhanced model had better performance where conventional risk factors were incorporated (enhanced model). The proposed models predicted CVD incidence at the individual level, and incorporated routine information with a sequential Bayesian clinical reasoning approach. Patients’ background contributed significantly to baseline risk. Even with ecological heterogeneity, the regression model adopted individual characteristics and made individual risk prediction for the CVD incidence. The limitation of the model mentions the interactions between predictors and cross-validated through external validation by applying the proposed models to new subjects not included in the training of the model parameters.
Bayesian Multiple Imputation and Logistic regression was conducted by Austin et al. (2021) on missing data, from clinical research. They mentioned missing values are not measured or recorded for all subjects are due to varied reasons: (i) patient refusal to respond to specific questions (ii) loss of patient to follow-up; (iii)investigator or mechanical error (iv) physicians not ordering certain investigationsfor some patients. The study emphasizes on understanding the type of missing data (MAR, MNAR, MCAR). Multiple imputation (MI) address missing data, provides multiple plausible values for missing values of a given variable resulting in creation of multiple completed data sets, conducts identical statistical analyses on each completed data sets, and the pooled results from across complete data sets, are then analyzed. The study conducted MI, and mentioned steps through its implementation and development, emphasizing on the number of imputed data sets and to create, and addresses derived variables. They provided application of MI in analyzing patients hospitalized with heart failure to estimate the probability of 1-year mortality in the presence of missing data using (R, SAS, and Stata)Austin et al. (2021).
Our study applies Bayesian logistic regression on a dataset with quasi-separation issue, missing values and the the resultant small sample size. It is a retrospective study, aimed to analyze the relationship between diabetes status and key predictors (body mass index (BMI), age, gender, and race), using NHANES survey data collected between 2013-2014. On exploring the dataset, which initially had 9813 observations and selected 5 variables, later revealed that the effective sample size was reduced due to complete case analysis and listwise deletion of missing data. Considering the small sample size and quasi-separation issue is as a challenge for traditional logistic regression models as complete-case model showed evidence of quasi-separation, with implausibly large coefficients and unstable estimates. This motivated us to apply multiple imputation (MICE) and conduct Bayesian logistic regression to provide a flexible framework for modeling uncertainty and incorporating prior knowledge and avoiding the issue of separation and small dataset.